data = readRDS("../data/prepd.rds")
dim(data)
## [1] 30190 40
close.to.finland =
c("2", "3", "5", "6", "7", "10", "11", "12", "14", "15", "16", "19", "20",
"22", "23", "24", "27", "28", "31", "32", "36", "37", "41", "42", "45",
"46", "47", "49", "50", "51", "52", "53", "54", "55", "56", "57", "59",
"60", "61", "62", "63", "64")
data = data[data$rectangle %in% close.to.finland, ]
data = data[!is.na(data$month), ]
The proportion of missing values is negligible (< 10%) for all the
geographical and catch variables. It is significant (> 30%) for all
the variables that measure the water characteristics and for
surface.
The distribution of missing values over the years is approximately constant, suggesting that the pattern of missingness is not time-dependent. Or at least that it does not have a trend.
The missing values do not appear to have a seasonal component; their proportion remains relatively constant over the year. There seems to be a shallow peak in winter and around New Year.
What is the geographical distribution of the missing data?
There are just three locations with proportions of missing data greater than 30%. The reason is that there are (years, months) in which no species catches are measured at all. For the rest, all the species are monitored. It happens in (59.75, 25.5):
| props | Freq |
|---|---|
| 0 / 16 | 234 |
| 16 / 16 | 107 |
It happens in (65.25, 23.5):
| props | Freq |
|---|---|
| 0 / 16 | 98 |
| 16 / 16 | 95 |
And it happens (65.25, 23.5):
| props | Freq |
|---|---|
| 0 / 16 | 172 |
| 16 / 16 | 87 |
Based on the findings above, we:
chlorophyll.a, suspended.solids
and ice variables because more than 60% of their values are
missing.data = data[, setdiff(colnames(data), c("chlorophyll.a", "suspended.solids", "ice"))]
missing.species = apply(data[, species], 1, function(x) sum(is.na(x)))
data = data[missing.species != length(species), ]
The yearly volume of catch across all species and locations has a cyclic pattern with a period of about 15 years. Here plotted with 95% confidence bands.
The catches for each species over all locations evolve over time. Here they are standardised and plotted with 95% confidence bands.
The catches overall have a seasonality that comes from fishing seasons: they concentrate between April and July.
As expected, most species are caught for the most part between April and July. A few exceptions, like sprat, vendace, burbot, are caught primarily in winter. They may account for the slight increase in catch between November and January.
The Spearman correlation between species is mostly positive, which suggests that species are not competing for the same (limited) resources in the same areas.
Laura: There are two mechanisms that tie some of the species together: The ones that are most correlated (bream, roach, ide, pike, perch) both line in rather similar habitats (coastal, shallow, near islands, etc.), and they most likely also come from the same fishery: coastal gill-nets and other non-discriminating gear. On the other hand, this should also mean that the fish come in approximately the same proportions to the catch than they are in the nature (although of course how actively they move etc. is species-specific to some extent, which biases this ratio).
The only quota species here are salmon, herring, and sprat, the latter of which has not traditionally been a targeted species in Finland. There are some technical restrictions in the fishery of other species too, such as minimum landing sizes, but I would not expect them to drive the catches.
These correlations appear to change over time, suggesting that the (vector time series) of the catches might not be homogeneous. The most evident change is the increasing correlation between sprat and herring in the bottom left corner. The correlation between whitefish and several other fishes appears to increase with time.
The spatial distribution of the catches varies between different species: there are a few locations in which many species are caught (ide, roach, pike, perch, burbot in (60.75, 2.15) and surrounding locations), but there also are species that a caught in completely different places (vendace). In addition, there are species that are caught in just a single location (vendace, ide, roach, pike, perch, burbot) and species that are caught all over the monitored area (trout, salmon).
Laura: Basically, I would expect certain factors, such as temperature or its proxies, to explain a lot of variability in the catches of some species, while the geography may be more important for others. If we can find evidence of the actual water quality variables having an effect, that would be nice.
Below are the temperatures recorded for each species, that is, the temperatures in those observations in which the species catch is strictly positive. Vendace has a profile that is clearly different from the other species. Other differences are more subtle. Some species have a second, smaller peak just above 15C, while others do not.
If we tally the catches across temperature ranges, we can see some more differences among species. The plots below show the frequency of temperatures (x-axis) and normalised caches (y-axis) jointly-discretised values. Note that both variables, so the intervals differ between plots, should only be interpreted in relative terms.
For some species, there is a strong association between temperature and catches. Burbot and pike are only caught in large quantities at low temperatures. Perch catches appear to increase with temperature. On the other hand, salmon is only caught in large quantities at higher temperatures. Pike is mainly caught in large quantities at lower temperatures.
The data contain several variables describing the characteristics of
each location. In addition to the coordinates of its bounding box
(lon.min, lon.max, lat.min,
lat.max), we have the actual sea water.area
and how much of it is open.water and
coastal.zone.
The actual sea area is smaller than that in the bounding box and varies in [75, 3026] depending on the coastline.
The total area within the bounding box in each location varies in [2543, 3120], so the proportion of water area is distributed as follows.
All the missing values for this variable are associated with a single location.
unique(data[is.na(data[, "area"]), c("lon", "lat")])
## lon lat
## 44 25.5 59.75
unique(data[is.na(data[, "water.area"]), c("lon", "lat")])
## lon lat
## 44 25.5 59.75
The proportion of open sea varies by location, from almost none to completely open sea. As expected, it is greater in locations farther from the coast. The proportion of coastal area is greater for locations close to the coasts and is the complement to the open sea areas.
The missing values for both variables are associated with the same
location for which we do not observe area and
water.area.
unique(data[is.na(data[, "open.water"]), c("lon", "lat")])
## lon lat
## 44 25.5 59.75
unique(data[is.na(data[, "coastal.zone"]), c("lon", "lat")])
## lon lat
## 44 25.5 59.75
The surface sea temperatures range from freezing to just cool. They vary over seasons and locations, sometimes dramatically and sometimes not. Missing values can be found in all locations and all years/months.
The distribution of the temperatures does not appear to change over time.
The amount of phosphorus and nitrogens are forms of pollution (from agriculture?) and are related as a result.
Therefore, they have a similar distribution over locations. With over 40% missing values, there are no complete data for any locations or years.
The distribution of nitrogen over locations does not
appear to change over time.
The distribution of phosphorus does appear to
change in the 2020s.
Nitrogens present systematic seasonal patterns: in each location, it is higher in winter/spring than in summer/autumn.
A similar pattern appears to hold for phosphorus.
The variables turbidity and secchi both
measure how clear the water is, albeit on opposite scales.
Laura: Since they have about the same amount of missing values, I suggest we can drop Secchi and keep turbidity, out of the two.
This relationship is broadly reflected in the values in all locations.
The distribution of turbidity over locations does not
appear to change over time.
Neither does that of secchi.
Turbidity does not present any particular seasonal patterns.
Neither does secchi.
Laura: Secchi would be expected to depend on both turbidity and water colour.
The distribution of the colour number is bell-shaped on a log-scale but somewhat asymmetric, with a short left tail. The reason is that water colour is defined to be positive on the natural scale.
Laura: The colour value is an old technique, traditionally done visually by comparing the water to standardised colour samples.
It’s usually related to humic components that are dissolved in the water and make the water appear brown – in contrast to turbidity, which consists of non-dissolved components (often clay etc.) that scatter light. There is no theoretical upper limit to colour value as far as I’m aware. Waters from bog ponds can reach values up to 200 or so.
Broadly, colour.number takes lower values away from the
coast. It gets higher in the northern locations and is exceptionally
high only in one location. This pattern becomes more apparent if we
count the proportion of observations which take
colour.number values greater than 25 at each location.
Secchi is marginally correlated with both turbidity and
colour.number, is correlated with turbidity
given colour.number, and is correlated with
colour.number given turbidity. This suggests
that, indeed, secchi depends on both turbidity and water
colour. (Using Spearman correlation as before.)
On the other hand, turbidity appears to be correlated
with secchi given colour but not with
colour.number given secchi. The marginal
correlation between turbidity and
colour.number does not appear to be particularly strong,
either. Therefore, it is likely that turbidity will not be
linked to water.colour even if we drop
secchi.
Water salinity seems to increase as you move from north-east to south-west, away from Finland and towards Denmark and the open ocean.
The distribution of salinity does not appear to change over time.
Salinity does not present any particular seasonal patterns.
str(data)
## 'data.frame': 20034 obs. of 37 variables:
## $ year : Factor w/ 44 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ rectangle : Factor w/ 42 levels "2","3","5","6",..: 1 2 4 5 7 8 10 11 12 13 ...
## $ herring : num 0 4 0 46 0 ...
## $ sprat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cod : num 0 314 0 0 0 0 0 0 0 0 ...
## $ flounder : num 0 0 0 0 0 0 0 0 0 0 ...
## $ whitefish : num 1029 293 603 210 8738 ...
## $ salmon : num 0 0 0 0 7 0 0 0 7 5 ...
## $ trout : num 0 0 12 20 68 0 106 95 9 17 ...
## $ smelt : num 0 0 42 660 239 0 0 0 231 28 ...
## $ bream : num 0 0 0 0 77 0 0 0 45 0 ...
## $ ide : num 0 0 105 0 298 105 0 0 2 0 ...
## $ roach : num 0 0 30 52 0 0 0 0 118 0 ...
## $ pike : num 9 41 256 8 421 0 45 7 1450 93 ...
## $ perch : num 44 0 326 91 1234 ...
## $ sander : num 0 0 0 0 0 422 0 0 105 0 ...
## $ burbot : num 2167 0 11 273 48 ...
## $ vendace : num 0 0 0 631 0 0 0 0 0 0 ...
## $ phosphorus : num NA NA NA 22 NA NA NA NA NA 20 ...
## $ colour.number: num NA NA NA 94.4 NA ...
## $ nitrogen : num NA NA NA 502 NA ...
## $ turbidity : num NA NA NA 2.97 NA ...
## $ salinity : num NA NA NA NA NA NA NA NA NA NA ...
## $ secchi : num NA NA NA NA NA NA NA NA NA NA ...
## $ temperature : num NA NA NA 0.138 NA ...
## $ area : num 2543 2543 2593 2593 2642 ...
## $ coastline : num 610 130 90 470 330 200 380 110 1080 190 ...
## $ open.water : num 700 0 2200 0 1100 0 1600 0 1400 0 ...
## $ water.area : num 1295 115 2485 390 1525 ...
## $ coastal.zone : num 595 115 285 390 425 200 465 75 610 175 ...
## $ lon.min : num 24 25 24 25 24 25 23 24 22 23 ...
## $ lon.max : num 25 26 25 26 25 26 24 25 23 24 ...
## $ lat.min : num 65.5 65.5 65 65 64.5 64.5 64 64 63.5 63.5 ...
## $ lat.max : num 66 66 65.5 65.5 65 65 64.5 64.5 64 64 ...
## $ lon : num 24.5 25.5 24.5 25.5 24.5 25.5 23.5 24.5 22.5 23.5 ...
## $ lat : num 65.8 65.8 65.2 65.2 64.8 ...